Skip to content

[core] DataEvolution table supports concurrent updates to different columns#7867

Merged
JingsongLi merged 11 commits into
apache:masterfrom
steFaiz:de_conflict_improve
May 23, 2026
Merged

[core] DataEvolution table supports concurrent updates to different columns#7867
JingsongLi merged 11 commits into
apache:masterfrom
steFaiz:de_conflict_improve

Conversation

@steFaiz
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz commented May 15, 2026

Purpose

Currently, if we concurrently trigger columns update jobs for data evolution table, the later jobs will be failed:
image

This happens even we are updating different columns.
However, we could only forbid updating the same columns. This PR introduces a RowIdColumnConflictChecker and use binary search to accelerate check process.

BenchMark

I tested 100000 files with disjoint row range check against 100000 files with disjoint row range, each file contains 3 columns, the result is:

Iteration Check Files Expected Merged Ranges Check Time
0 100,000 100,000 11 ms
1 100,000 100,000 8 ms
2 100,000 100,000 8 ms
Best 100,000 100,000 8 ms
Average 100,000 100,000 9 ms

with binary search, the check cost is negligible.

Tests

UT Cases for Spark and Unit Tests for core.

@steFaiz steFaiz marked this pull request as draft May 15, 2026 08:49
@steFaiz steFaiz changed the title [core] DataEvolution table supports concurrent updates to different columns [wip][core] DataEvolution table supports concurrent updates to different columns May 15, 2026
@steFaiz steFaiz marked this pull request as ready for review May 15, 2026 13:20
@steFaiz steFaiz changed the title [wip][core] DataEvolution table supports concurrent updates to different columns [core] DataEvolution table supports concurrent updates to different columns May 15, 2026
@steFaiz steFaiz closed this May 18, 2026
@steFaiz steFaiz reopened this May 18, 2026
@steFaiz steFaiz closed this May 18, 2026
@steFaiz steFaiz reopened this May 18, 2026
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way not to add schemaId and writable Cols to SimpleFileEntry:
Do not add fields to SimpleFileEntry, but build RowIdColumnConflictChecker while still holding the complete ManifestEntry/DataFileMeta, and pass it to the conflict detection method

@steFaiz
Copy link
Copy Markdown
Contributor Author

steFaiz commented May 22, 2026

@JingsongLi Thanks! I think you are right! It's not wise to continuously broaden SimpleEntry. I've refactored the code

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: RowIdColumnConflictChecker for Column-Level Conflict Detection

Nice improvement — the column-aware conflict detection is well-designed. The binary search approach gives excellent performance as shown by the benchmarks, and the field-ID-based comparison correctly handles schema evolution (column renames). A few observations:

Binary Search Correctness

The firstPossibleRange method searches for the first WriteRange whose to >= target.from. However, writeRanges is sorted by (from, to). This works correctly because:

  • If a range's to < target.from, it cannot overlap regardless of its from
  • Finding the first range with to >= target.from is a valid starting point

But there's a subtle gap: after binary search, the linear scan uses writeRange.range.from > range.to as the early-exit condition. Since ranges are sorted by from, this is correct — once from exceeds the query's to, no subsequent ranges can overlap. Well done.

Thread Safety of fieldIdByNameCache

fieldIdByNameCache is a plain HashMap. This is fine as long as RowIdColumnConflictChecker instances are not shared across threads (they're constructed per commit attempt in FileStoreCommitImpl). Worth noting in a comment if the class might be reused in concurrent contexts in the future.

containsAnyWriteField with null writeCols on the checked file

When the committed incremental file has writeCols == null, containsAnyWriteField returns true immediately (line ~179). This is correct and conservative — a full-schema write can conflict with any partial write.

Merged Range Inflation

In testMergesOverlappedDeltaRangesAndWriteColumns: files [0,10] writing "b" and [5,15] writing "c" are merged into a single WriteRange [0,15] with fieldIds {b, c}. This means a file at range [12,12] writing only "b" will be flagged as conflicting, even though the actual "b" write only covers [0,10]. This is an over-approximation — acceptable for correctness (false positive = retry, not data loss), but could lead to unnecessary commit failures when many small partial writes overlap in row ranges but target different columns. The current approach is a reasonable trade-off for simplicity and performance.

baseSnapshotId in DataEvolutionMergeIntoAction

Good fix — pinning baseSnapshotId ensures the scan and conflict check use a consistent view. The .withSnapshot(baseSnapshotId) on the scan in shuffleByFirstRowId prevents TOCTOU issues where new snapshots appear between job planning and execution.

Minor Suggestion

In DataEvolutionPartialWriteOperator, the baseSnapshotId field is declared as Long (boxed). Since it's always assigned from a primitive long and is never null, consider using long to avoid accidental NPE and unnecessary boxing.

Test Coverage

The unit tests cover the key scenarios well (disjoint columns, same columns, rename across schemas, null writeCols, merged ranges, binary search scan-past). The Spark IT test (concurrent merge with disjoint update columns) validates the end-to-end behavior nicely.

Overall: solid design, clean implementation, good test coverage.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 55e04ea into apache:master May 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants